5000 Fastest Growing Private Companies in the U.S.

Summary statistics

Preliminary look at the Inc. 5000 Company List data set.

List of all variables in the dataframe

##  [1] "row_num"     "id"          "rank"        "workers"     "company"    
##  [6] "url"         "state_l"     "state_s"     "city"        "metro"      
## [11] "growth"      "revenue"     "industry"    "yrs_on_list"

Dimensions of the dataframe

5000 companies observed over 14 variables listed above

## [1] 5000   14

Structure of dataframe with preview of data values

## 'data.frame':    5000 obs. of  14 variables:
##  $ row_num    : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ id         : int  22890 25747 25643 26098 26182 22913 22937 25413 26079 25861 ...
##  $ rank       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ workers    : int  227 191 145 62 92 50 129 130 264 11 ...
##  $ company    : Factor w/ 5000 levels "(add)ventures",..: 1725 3569 3651 4211 79 3520 1094 3357 4703 1826 ...
##  $ url        : Factor w/ 5000 levels "@properties",..: 1725 3569 3647 4211 76 3520 1094 3357 4703 1826 ...
##  $ state_l    : Factor w/ 51 levels "Alabama","Alaska",..: 5 5 48 5 22 20 5 3 38 34 ...
##  $ state_s    : Factor w/ 51 levels "AK","AL","AR",..: 5 5 47 5 20 22 5 4 38 28 ...
##  $ city       : Factor w/ 1352 levels "Acton","Ada",..: 355 355 37 930 737 46 1142 1103 979 1325 ...
##  $ metro      : Factor w/ 326 levels "","Adrian MI",..: 171 171 314 262 39 163 261 226 229 321 ...
##  $ growth     : num  158957 57348 55460 26043 20690 ...
##  $ revenue    : num  195640000 82640563 85076502 35293000 77652360 ...
##  $ industry   : Factor w/ 25 levels "Advertising & Marketing",..: 5 11 2 23 24 7 13 13 25 7 ...
##  $ yrs_on_list: int  2 1 1 1 1 2 2 1 1 1 ...

Explore factor variables and the different levels in State and Industry

State

##  [1] "Alabama"              "Alaska"               "Arizona"             
##  [4] "Arkansas"             "California"           "Colorado"            
##  [7] "Connecticut"          "Delaware"             "District of Columbia"
## [10] "Florida"              "Georgia"              "Hawaii"              
## [13] "Idaho"                "Illinois"             "Indiana"             
## [16] "Iowa"                 "Kansas"               "Kentucky"            
## [19] "Louisiana"            "Maine"                "Maryland"            
## [22] "Massachusetts"        "Michigan"             "Minnesota"           
## [25] "Mississippi"          "Missouri"             "Montana"             
## [28] "Nebraska"             "Nevada"               "New Hampshire"       
## [31] "New Jersey"           "New Mexico"           "New York"            
## [34] "North Carolina"       "North Dakota"         "Ohio"                
## [37] "Oklahoma"             "Oregon"               "Pennsylvania"        
## [40] "Puerto Rico"          "Rhode Island"         "South Carolina"      
## [43] "South Dakota"         "Tennessee"            "Texas"               
## [46] "Utah"                 "Vermont"              "Virginia"            
## [49] "Washington"           "West Virginia"        "Wisconsin"

Industry

##  [1] "Alabama"              "Alaska"               "Arizona"             
##  [4] "Arkansas"             "California"           "Colorado"            
##  [7] "Connecticut"          "Delaware"             "District of Columbia"
## [10] "Florida"              "Georgia"              "Hawaii"              
## [13] "Idaho"                "Illinois"             "Indiana"             
## [16] "Iowa"                 "Kansas"               "Kentucky"            
## [19] "Louisiana"            "Maine"                "Maryland"            
## [22] "Massachusetts"        "Michigan"             "Minnesota"           
## [25] "Mississippi"          "Missouri"             "Montana"             
## [28] "Nebraska"             "Nevada"               "New Hampshire"       
## [31] "New Jersey"           "New Mexico"           "New York"            
## [34] "North Carolina"       "North Dakota"         "Ohio"                
## [37] "Oklahoma"             "Oregon"               "Pennsylvania"        
## [40] "Puerto Rico"          "Rhode Island"         "South Carolina"      
## [43] "South Dakota"         "Tennessee"            "Texas"               
## [46] "Utah"                 "Vermont"              "Virginia"            
## [49] "Washington"           "West Virginia"        "Wisconsin"

Summary of the data set

##     row_num           id             rank         workers     
##  Min.   :   0   Min.   :    4   5000   :   1   Min.   :    0  
##  1st Qu.:1250   1st Qu.:19575   4999   :   1   1st Qu.:   24  
##  Median :2500   Median :23292   4998   :   1   Median :   50  
##  Mean   :2500   Mean   :20037   4997   :   1   Mean   :  209  
##  3rd Qu.:3749   3rd Qu.:25370   4996   :   1   3rd Qu.:  125  
##  Max.   :4999   Max.   :26620   4995   :   1   Max.   :34219  
##                                 (Other):4994                  
##            company                 url             state_l    
##  (add)ventures :   1   @properties   :   1   California: 694  
##  @Properties   :   1   110-consulting:   1   Texas     : 404  
##  110 Consulting:   1   123stores     :   1   New York  : 335  
##  123Stores     :   1   180           :   1   Florida   : 303  
##  180           :   1   180fusion     :   1   Virginia  : 284  
##  180Fusion     :   1   1seocom       :   1   Illinois  : 238  
##  (Other)       :4994   (Other)       :4994   (Other)   :2742  
##     state_s            city                metro          growth         
##  CA     : 694   New York : 178   New York City: 399   Min.   :    42.45  
##  TX     : 404   Chicago  :  95   Washington DC: 316   1st Qu.:    84.21  
##  NY     : 335   Atlanta  :  94   Los Angeles  : 274   Median :   151.72  
##  FL     : 303   Austin   :  87   Chicago      : 224   Mean   :   516.44  
##  VA     : 284   San Diego:  80   Atlanta      : 194   3rd Qu.:   347.65  
##  IL     : 238   Houston  :  76   Dallas       : 169   Max.   :158956.91  
##  (Other):2742   (Other)  :4390   (Other)      :3424                      
##     revenue                                   industry     yrs_on_list    
##  Min.   :   1953000   IT Services                 : 733   Min.   : 1.000  
##  1st Qu.:   4876791   Advertising & Marketing     : 453   1st Qu.: 1.000  
##  Median :  10722077   Business Products & Services: 435   Median : 2.000  
##  Mean   :  43058182   Health                      : 377   Mean   : 2.744  
##  3rd Qu.:  26952131   Software                    : 338   3rd Qu.: 4.000  
##  Max.   :5528202691   Financial Services          : 278   Max.   :12.000  
##                       (Other)                     :2386

Initial Observations from a summary of the data set

  • There are 5000 companies ranked from 1 to 5000 based on their percentage growth in 2014, from greatest rate of growth (ranked 1) to slowest rate of growth (ranked 5000).
    • Greatest rate of growth is 158956.91%, lowest is 42.45%
  • There are companies representing all 50 states plus one territory (Puerto Rico), resulting in 51 levels for state.
  • The minimum number of works is 0 (need to explore further how it is possible to have no employees) with the maximum at 34219. Most companies on the list have fewer than 150 employees.
  • The top 5 states with the greatest number of companies on the list are: California, Texas, New York, Florida, and Virginia. But the top 5 cities with the greatest number of companies on the list are: New York, Chicago, Atlanta, Austin, and San Diego. It may be worth figuring out why the top states and top cities don’t match.
  • The top industries representing greatest growth are: IT, Ad & Marketing, Business Products & Services, Health, and Software.
  • For most companies, it is their first or second year on the list. About a quarter have been on the list for more than 4 times, with 12 years being the highest number of years any one company has been on the list.

Univariate Plots Section

Histogram of states where companies are located

Histograms of workers by count

First plot doesn’t have small enough binwidths to see the trend. Reduce binwidth shows a histogram plot that skews right. What happens to distribution if I perform a long10 transformation?

Transforming the long tail by taking the log10 of workers helps better understand the distribution of workers. The transformed workers distribution looks close to a normal distribution with a longer tail on the right.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      24      50     209     125   34220

Distribution of industry

Top industry is IT Services with almost 800 companies. IT Services is the most represented industry by a large margin. The next two industries with greatest number of companies is Ad & Marketing and Business Products & Services with just over 400 companies each, counts that are just over half of IT Services.

Distribution of revenue

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

The revenue distribution is really skewed right with a very long tail. A log10 transformation and adjusting bin width provides a more natural way to see revenue data and illustrate trends in the data. However, even after a log10 transformation, the data is still skewed to the right. Removing extremely high revenue outliers helps show a more normal distribution.

Part of the reason the distribution doesn’t look entirely normal is because the log-normal distribution looks truncated on the left side. This is likely due to the dataset containing only the top 5000 companies. If the data extended to 10,000, for example, the curve will likely look more normally distributed.

Distribution of growth

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     42.45     84.21    151.70    516.40    347.70 159000.00

The long tail skew to the right justifies log10 transformation.

The distribution of growth and revenue look really similar. Let’s try another type of plot to tease apart how the distributions differ. The frequency polygon plot better shows the different shapes of the distributions. The amount of growth is based on revenue generated so it is not surprising the two distributions are similar since they are highly correlated.

Distribution of Growth vs. Revenue

## Loading required package: grid

Many of the highest ranked companies are small businesses. This could be because smaller companies grow faster than big public companies. But it could also be that smaller companies are starting with smaller amounts of revenues. Absolute growth in dollars is different from percentage growth. For example, company with no revenue the previous year that gains some revenue the next year has infinite percentage growth. But this isn’t a good reflection on how much revenue the company is generating compared to another company that’s making more in absolute revenue but has a lower percentage growth.

I created two new variables, revenue 2013, calculated in terms of current revenue and percentage growth to derive last year’s revenue, and growth in dollars, which is revenue 2013 subtracted from revenue 2014.

## [1] 123000 143853 153125 135000 373500 690697
## [1] 195517000  82496710  84923377  35158000  77278860 137286506

Population Dataset

There is a limitation in my data set. Without data about resident populations in each state or city or metro area it is hard to determine whether the states with the highest number of growing companies have growing companies because there are more people living there or if there is something special about that state that fosters growth. Therefore, I looked for population data from the U.S. Census Bureau and found population estimates for 2010 to 2014. This works with the company data from 2014 with the reverse engineered revenue and growth numbers I calculated for 2013.

The structure of the new dataset of state population data:

##   Geographic_Area Census_April1 Estimate_Base    Est_2010    Est_2011
## 1   United States   308,745,538   308,758,105 309,347,057 311,721,632
## 2       Northeast    55,317,240    55,318,348  55,381,690  55,635,670
## 3         Midwest    66,927,001    66,929,898  66,972,390  67,149,657
## 4           South   114,555,744   114,562,951 114,871,231 116,089,908
## 5            West    71,945,553    71,946,908  72,121,746  72,846,397
## 6         Alabama     4,779,736     4,780,127   4,785,822   4,801,695
##      Est_2012    Est_2013    Est_2014
## 1 314,112,078 316,497,531 318,857,056
## 2  55,832,038  56,028,220  56,152,333
## 3  67,331,458  67,567,871  67,745,108
## 4 117,346,322 118,522,802 119,771,934
## 5  73,602,260  74,378,638  75,187,681
## 6   4,817,484   4,833,996   4,849,377

Using dplyr, I can create a new dataset that aggregates all growth and revenue numbers for companies by state and calculates the growth per capita.

The variables and structure of the new dataset.

## [1] "state_l"              "state_growth_dollar"  "state_population2014"
## [4] "growth_per_capita"
## 'data.frame':    51 obs. of  4 variables:
##  $ state_l             : Factor w/ 51 levels "Alabama","Alaska",..: 5 45 14 33 36 48 10 22 23 44 ...
##  $ state_growth_dollar : num  18309472149 17237538957 7298472080 6272947980 6133911042 ...
##  $ state_population2014: num  38802500 26956958 12880580 19746227 11594163 ...
##  $ growth_per_capita   : num  472 639 567 318 529 ...

Univariate Analysis

What is the structure of your dataset?

I have two datasets. The original dataset is a list of the 5000 fastest growing private companies in 2014 in the U.S. from Inc. 5000. The second dataset I have is state population data from the Census Bureau. I have two resulting data frames: companies is the Inc. 5000 data set with new variables added, and state_growth is population data with additional variables.

What is/are the main feature(s) of interest in your dataset?

The variables most interesting to explore are the growth in percentage and dollar amounts since the dataset from Inc. 5000 is specifically about the fastest growing private companies in the U.S. I am also very interested in the industry the companies are in.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Revenue will be important way to understand growth. For example, a company with a small revenue will see greater gains in percentage growth than a company with larger revenue amount but the latter could have a much greater revenue and growth in absolute dollar amounts. So it is critical to interpret growth in light of revenue.

State population data is also important to better understand growth. A larger state might appear to have greater growth in absolute dollar amounts but that could be influenced by a greater population. Therefore investigating growth per capita can provide a fairer way to look at growth, especially from the point of view of smaller states.

Did you create any new variables from existing variables in the dataset?

I created 4 new variables from existing variables across two datasets I created two new variables in the companies data frame: 1. revenue2013, 2. growth_dollar. I reverse engineered revenue from 2013 using revenue from 2014 and percentage growth. Then I subtracted the 2013 revenue from 2014 revenue to get the growth_dollar.

I also created a new dataframe using the state population data from the census. In this dataframe, I added two other variables: 3. state_growth_dollar and 4. growth_per_capita. state_growth_dollar was calculated by grouping together states and summing the growth_dollar derived from the 2nd variable I created growth_dollar. The growth_per_capita variable was created by dividing growth_dollar by the state population.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The revenue, growth, and workers histograms all skewed right with a very long tail. I had to perform a log transformation to better understand the data. I performed a lot of tidying and adjusting to import and join the two data frames, including converting the population data to a numeric because the commas that separated the thousands place was causing the read.csv() command to import population numbers as characters. I needed population numbers to be numeric so I could perform division to calculate the growth_per_capita.

Bivariate Plots Section

The structure of the two datasets: 1. State population and aggregate growth of companies 2. 5000 fastest growing companies and the attributes that describe them

Scatter Matrix plots to understand the relationships between variables in the two datasets

## 'data.frame':    5000 obs. of  16 variables:
##  $ row_num          : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ id               : int  22890 25747 25643 26098 26182 22913 22937 25413 26079 25861 ...
##  $ rank             : Ord.factor w/ 5000 levels "5000"<"4999"<..: 5000 4999 4998 4997 4996 4995 4994 4993 4992 4991 ...
##  $ workers          : int  227 191 145 62 92 50 129 130 264 11 ...
##  $ company          : Factor w/ 5000 levels "(add)ventures",..: 1725 3569 3651 4211 79 3520 1094 3357 4703 1826 ...
##  $ url              : Factor w/ 5000 levels "@properties",..: 1725 3569 3647 4211 76 3520 1094 3357 4703 1826 ...
##  $ state_l          : Factor w/ 51 levels "Alabama","Alaska",..: 5 5 48 5 22 20 5 3 38 34 ...
##  $ state_s          : Factor w/ 51 levels "AK","AL","AR",..: 5 5 47 5 20 22 5 4 38 28 ...
##  $ city             : Factor w/ 1352 levels "Acton","Ada",..: 355 355 37 930 737 46 1142 1103 979 1325 ...
##  $ metro            : Factor w/ 326 levels "","Adrian MI",..: 171 171 314 262 39 163 261 226 229 321 ...
##  $ growth_percentage: num  158957 57348 55460 26043 20690 ...
##  $ revenue2014      : num  195640000 82640563 85076502 35293000 77652360 ...
##  $ industry         : Factor w/ 25 levels "Advertising & Marketing",..: 5 11 2 23 24 7 13 13 25 7 ...
##  $ yrs_on_list      : int  2 1 1 1 1 2 2 1 1 1 ...
##  $ revenue2013      : num  123000 143853 153125 135000 373500 ...
##  $ growth_dollar    : num  195517000 82496710 84923377 35158000 77278860 ...
## 'data.frame':    5000 obs. of  7 variables:
##  $ workers          : int  227 191 145 62 92 50 129 130 264 11 ...
##  $ growth_percentage: num  158957 57348 55460 26043 20690 ...
##  $ revenue2014      : num  195640000 82640563 85076502 35293000 77652360 ...
##  $ industry         : Factor w/ 25 levels "Advertising & Marketing",..: 5 11 2 23 24 7 13 13 25 7 ...
##  $ yrs_on_list      : int  2 1 1 1 1 2 2 1 1 1 ...
##  $ revenue2013      : num  123000 143853 153125 135000 373500 ...
##  $ growth_dollar    : num  195517000 82496710 84923377 35158000 77278860 ...

Which state has greatest revenue growth per capita in 2014?

The state growth in dollars shows most states clustered in the same area under $7.5 Billion. However, there are 2 states, California and Texas, with a extremely large amount of growth at $17-18 Billion. But when looking at state growth in dollars per capita, the top two states are Virginia and Colorado with Texas trailing closely behind Colorado. How does population affect growth? Future plots should explore the relationship between state population and revenue as well as state population and growth to uncover other trends.

## $title
## [1] "Revenue growth by state, normalized by population"
## 
## attr(,"class")
## [1] "labels"

As concluded earlier, Virginia, Colorado, and Texas have the fastest growth in dollars per capita. California is trailing at #13.

What’s the relationship between Revenue and Growth?

Note: Refer to the frequency polygon and density plots in the Univariate section to see the differences in distribution between revenue in 2014 and percentage growth.

The relationship between revenue in 2014 and growth appears to be strongly correlated based on the Pearson’s r value, 0.95 for 2014 revenue and growth in dollars.

## 
##  Pearson's product-moment correlation
## 
## data:  companies$revenue2014 and companies$growth_dollar
## t = 208.4471, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9440788 0.9498019
## sample estimates:
##       cor 
## 0.9470155

A highly correlated relationship also exists between revenue in 2013 and growth in dollars, with a Pearson’s r correlation of 0.77. However this relationship is weaker than the relationship between revenue in 2014 and growth, which is expected since the dataset is focusing on fastest growing companies in 2014. Growth in 2014 is clearly tied to revenue in 2013 hence the relationship between growth and revenue in 2013 is expectedly high.

## 
##  Pearson's product-moment correlation
## 
## data:  companies$revenue2013 and companies$growth_dollar
## t = 84.3823, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7548497 0.7777247
## sample estimates:
##       cor 
## 0.7665302

Growth measured in dollars and percentage by 2013 Revenue plots with a log10 transformation have a very unusual fan-shaped distribution. Contrasting with the 2014 Revenue plots which are more cone-shaped. This trend could be explained by the same trend marking why the revenue histogram appeared to be truncated on the left side. This fan-shaped is likely a result of the fact that companies with 2013 revenue less $1,000,000 were excluded from the list.

Revenue and growth percentage

Revenue by Years on List

Plots looking at the feature number of years a company has appeared on the Inc. 5000 list.

## Warning in loop_apply(n, do.ply): Removed 64 rows containing missing values
## (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  companies$yrs_on_list and companies$revenue2014
## t = 10.7641, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1233182 0.1775016
## sample estimates:
##      cor 
## 0.150523

Relationship between Industry and Growth

## Warning in loop_apply(n, do.ply): Removed 124 rows containing missing
## values (geom_point).

## Warning in loop_apply(n, do.ply): Removed 358 rows containing non-finite
## values (stat_boxplot).

## Warning in loop_apply(n, do.ply): Removed 64 rows containing missing values
## (geom_point).

## Warning in loop_apply(n, do.ply): Removed 21 rows containing missing values
## (geom_point).

Relationship between workers and growth

## Warning in loop_apply(n, do.ply): Removed 830 rows containing missing
## values (geom_point).

## Warning in loop_apply(n, do.ply): Removed 816 rows containing missing
## values (stat_summary).
## Warning in loop_apply(n, do.ply): Removed 827 rows containing missing
## values (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  companies$workers and companies$growth_dollar
## t = 16.0263, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1945545 0.2472854
## sample estimates:
##       cor 
## 0.2210815
## Warning in loop_apply(n, do.ply): Removed 852 rows containing missing
## values (stat_summary).
## Warning in loop_apply(n, do.ply): Removed 862 rows containing missing
## values (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  companies$workers and companies$growth_percentage
## t = -0.8873, df = 4998, p-value = 0.3749
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04025575  0.01517411
## sample estimates:
##         cor 
## -0.01255046

Relationship between workers and growth is quite weak. This makes sense as the number of workers isn’t necessarily predictive of growth compared to revenue or even the product being produced.

Relationship between growth and years on list

## Warning in loop_apply(n, do.ply): Removed 69 rows containing missing values
## (geom_point).

## Warning in loop_apply(n, do.ply): Removed 469 rows containing missing
## values (geom_point).

Other exploratory bivariate plots

## Warning in loop_apply(n, do.ply): Removed 65 rows containing missing values
## (geom_point).

## Warning in loop_apply(n, do.ply): Removed 362 rows containing missing
## values (geom_point).

Growth in terms of state population

Growth in Dollars

## Warning in loop_apply(n, do.ply): Removed 9 rows containing missing values
## (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  state_growth$state_population2014 and state_growth$state_growth_dollar
## t = 18.5743, df = 49, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8895717 0.9630005
## sample estimates:
##       cor 
## 0.9357539

Generally the trend appears to be that the higher the population, the greater the total state growth in dollars. This theory is supported by the highly Pearson r correlation between state population in 2014 and state growth dollar variables, 0.94.

Growth per capita

## Warning in loop_apply(n, do.ply): Removed 1 rows containing missing values
## (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  state_growth$state_population2014 and state_growth$growth_per_capita
## t = 2.6938, df = 49, p-value = 0.009647
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.09274621 0.57756852
## sample estimates:
##       cor 
## 0.3591502

On the other hand, the correlation between state population and growth per capita didn’t have as strong of a relationship, Pearson’s r of 0.36.

Relationship between Industry and Years on List

## Warning in loop_apply(n, do.ply): Removed 37 rows containing non-finite
## values (stat_boxplot).

## Warning in loop_apply(n, do.ply): Removed 37 rows containing non-finite
## values (stat_ydensity).

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

There is a very strong relationship between 2014 revenue and growth. There is a weaker but still very strong relationship between 2013 revenue and growth.

The other features didn’t seem to affect growth as much as revenue, trailing behind by a lot.

It looks like there is some positive relationship between state population and growth per capita with a Pearson’s r correlation of 0.36. However, the scatter plot of the two variables didn’t look promising so I think analysis of state population and growth is a dead end.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The relationship between workers and growth seems to suggest there is a very weak relationship. The plots don’t follow a positive or negative relationship and the Pearson’s r value, 0.22, suggest that there is a very slightly positive relationship but it is very weak.

Most companies have been on the fastest growing companies list less than 3 times. This suggest that there isn’t much repeat of past growth performance, i.e. it is difficult to be the fastest growing company more than 7 times, which makes sense because fast growth is difficult to sustain. It is not clear whether the number of years a company has been on the list affects growth. Generally it looks like there isn’t much of a trend here.

What was the strongest relationship you found?

The strongest relationship observed was between 2014 revenue and growth dollar (with r value of 0.95) closely followed by the relationship between 2014 state population and 2014 state total revenue (r value of 0.94).

Multivariate Plots Section

I removed 5 outliers who revenues were greater than $3 Billion so I could see the general trend in the data. However, I would like to see only the outliers to determine what kinds of trends are observable from the growth and industry they are in.

Top 5 Companies by Growth and Revenue colored by Industry

Recoding Industry to drill deeper

Industry has too many values (25 levels). I need to group the existing industries into overarching categories to plot using color. If not, the values will be too difficult to see on a scatterplot. I grouped industries based on the [Global Industry Classification Standard],(http://en.wikipedia.org/wiki/Global_Industry_Classification_Standard) developed by MSCI and S&P.

##  [1] "Advertising & Marketing"      "Business Products & Services"
##  [3] "Computer Hardware"            "Construction"                
##  [5] "Consumer Products & Services" "Education"                   
##  [7] "Energy"                       "Engineering"                 
##  [9] "Environmental Services"       "Financial Services"          
## [11] "Food & Beverage"              "Government Services"         
## [13] "Health"                       "Human Resources"             
## [15] "Insurance"                    "IT Services"                 
## [17] "Logistics & Transportation"   "Manufacturing"               
## [19] "Media"                        "Real Estate"                 
## [21] "Retail"                       "Security"                    
## [23] "Software"                     "Telecommunications"          
## [25] "Travel & Hospitality"

How are the top 3 industry categories different from the other industries?

What is the relationship of workers in the different industries?

Let’s take a look at the relationship among the variable workers, revenue, growth, and other factors that might shed light on growth of companies.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      24      50     209     125   34220

## 
##  Pearson's product-moment correlation
## 
## data:  companies$workers and companies$growth_dollar
## t = 16.0263, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1945545 0.2472854
## sample estimates:
##       cor 
## 0.2210815

Companies that have lower number of employees also tend to have a lower amount of revenue and therefore lower growth in dollars (since growth is derived from revenue). There is a consistent trends of the companies with the greatest number of employees also having the highest revenue and growth. The second plot with revenue and percentage growth colored by workers shows that companies with fewer employees also have lower percentage growth. This conclusion dispels the theory I proposed earlier that smaller companies with fewer employees and lower revenue in dollars have greater percentage growth because every incremental revenue dollar accounts for a greater degree of growth.

Rather, the trend seems to point to the fact that companies with the greatest number of employees also tend to generate the most revenue and growth calculated in dollars and percentage. This conclusion is further supported by the scatter plot that shows the relationship between revenue and growth per worker. There is quite clear grouping of revenue and growth per worker based on the workers groups.

State Comparison of Growth by Revenue

The top states are all very similar in their distribution of growth by revenue. Some outliers differ but the boxplots are generally almost the same suggesting that there is not much of a difference across the different quartiles.

There isn’t a clear trend when it comes to which states and industries. Among the top states, it doesn’t appear any one has a certain industry overrepresented. It also seems like the growth by revenue trends are consistent across all states - positive strong relationships. No states have increasing revenues but downward growth. State probably doesn’t play a big role in determining growth for a company. This is surprising as I assumed states like California and perhaps Texas or New York would be favorable towards the tech industries.

What effect does industry have on growth?

The effect of industry on revenue and growth colored by number of employees. The top 3 industries have the densest points so the sample size is greater. But there is no consistent pattern across industries when it comes to workers, i.e. no industry has particularly small numbers of employees for very large numbers. All industries have a range of employee numbers. However, employee numbers do support the previous findings that the categories of smaller numbers of employees tend to have lower income and lower growth.

Multivariate Analysis

This plot also supports previous conclusion that Healthcare, Energy, and to some degree Industrials and Financials industries exceed the other industries in the growth and revenue they achieved. There are companies in these industries in the upper right most corner, which the IT, Consumer Sector, and Telecommunication Services industries don’t reach.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Previous analysis showed there is a strong relationship between revenue and growth. Multivariate plots also showed that the industry matters: Healthcare, Energy, and Industrials achieve the highest degree of revenue and growth compared to IT, Consumer Sector, and Telecommunication Services industries. The Financials industry can be grouped with the higher revenue and growth group but it was represented as one of the outliers in the revenue versus growth plots.

The greater the number of employees at a company, there more likely the company will have greater revenue and growth.

Were there any interesting or surprising interactions between features?

There is a positive relationship between revenue and growth which dispels the previous conclusion that smaller companies that may have less growth but it is represented in higher percentage growth. The same companies with the greatest revenue also had the high growth measured in dollars and percentage.

The multivariate plots suggest that the top states have different industries represented. I though California would stand out with more Information Technology companies and Texas would have more Energy companies. But these stereotypes don’t seem to hold: the top states generally have a diverse range of industries which suggest the trend is that they have many fast growing companies in different sectors rather than companies specializing in one or two specific industries.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## 
## Attaching package: 'memisc'
## 
## The following object is masked from 'package:plyr':
## 
##     rename
## 
## The following object is masked from 'package:scales':
## 
##     percent
## 
## The following objects are masked from 'package:dplyr':
## 
##     collect, query, rename
## 
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## 
## The following object is masked from 'package:base':
## 
##     as.array
## 
## Calls:
## m1: lm(formula = I(growth_dollar) ~ I(revenue2014), data = companies)
## m2: lm(formula = I(growth_dollar) ~ I(revenue2014) + workers, data = companies)
## m3: lm(formula = I(growth_dollar) ~ I(revenue2014) + workers + industry, 
##     data = companies)
## m4: lm(formula = I(growth_dollar) ~ I(revenue2014) + workers + industry + 
##     state_l, data = companies)
## 
## ===============================================================================================================================================================
##                                                                           m1                      m2                      m3                      m4           
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept)                                                         1374603.674**           1817698.230***          2007177.219              1065470.861       
##                                                                     (478521.520)            (480417.066)           (1538968.379)            (4733320.471)      
## I(revenue2014)                                                            0.534***                0.539***                0.538***                 0.538***    
##                                                                          (0.003)                 (0.003)                 (0.003)                  (0.003)      
## workers                                                                                       -3134.358***            -3089.363***             -3080.919***    
##                                                                                                (447.373)               (452.749)                (454.386)      
## industry: Business Products & Services/Advertising & Marketing                                                     -3119393.552             -3021343.377       
##                                                                                                                    (2200078.743)            (2220804.403)      
## industry: Computer Hardware/Advertising & Marketing                                                                -2956015.132             -1562875.761       
##                                                                                                                    (5822186.461)            (5848354.253)      
## industry: Construction/Advertising & Marketing                                                                     -5316741.855             -5449915.348       
##                                                                                                                    (2790909.334)            (2812350.822)      
## industry: Consumer Products & Services/Advertising & Marketing                                                      4605429.609              4205288.581       
##                                                                                                                    (2652069.212)            (2672826.611)      
## industry: Education/Advertising & Marketing                                                                        -1322579.756             -1642664.045       
##                                                                                                                    (4013726.247)            (4033584.236)      
## industry: Energy/Advertising & Marketing                                                                            9430816.841**            8727480.348*      
##                                                                                                                    (3435529.337)            (3490578.903)      
## industry: Engineering/Advertising & Marketing                                                                       -229146.377              -243445.134       
##                                                                                                                    (4313868.518)            (4348440.153)      
## industry: Environmental Services/Advertising & Marketing                                                           -1925372.873             -1686596.075       
##                                                                                                                    (4713620.455)            (4758270.019)      
## industry: Financial Services/Advertising & Marketing                                                                1834675.721              1601768.627       
##                                                                                                                    (2495622.481)            (2517124.263)      
## industry: Food & Beverage/Advertising & Marketing                                                                   2227445.858              2151869.828       
##                                                                                                                    (3244201.932)            (3261328.033)      
## industry: Government Services/Advertising & Marketing                                                               1009980.742              1069036.375       
##                                                                                                                    (2724844.977)            (2970302.187)      
## industry: Health/Advertising & Marketing                                                                             980210.080               758142.041       
##                                                                                                                    (2285744.618)            (2309605.321)      
## industry: Human Resources/Advertising & Marketing                                                                    -40987.327              -256316.214       
##                                                                                                                    (2806357.091)            (2830929.000)      
## industry: Insurance/Advertising & Marketing                                                                        -1970000.651             -1841535.891       
##                                                                                                                    (4287722.067)            (4314119.050)      
## industry: IT Services/Advertising & Marketing                                                                      -1227918.923             -1174120.639       
##                                                                                                                    (1957231.860)            (1983913.330)      
## industry: Logistics & Transportation/Advertising & Marketing                                                       -1653283.637             -1696519.910       
##                                                                                                                    (3243380.467)            (3279961.492)      
## industry: Manufacturing/Advertising & Marketing                                                                    -2845128.785             -2216298.539       
##                                                                                                                    (2725135.668)            (2760189.801)      
## industry: Media/Advertising & Marketing                                                                             -446152.119              -203801.322       
##                                                                                                                    (4498253.851)            (4521311.731)      
## industry: Real Estate/Advertising & Marketing                                                                       3066524.290              2825411.406       
##                                                                                                                    (3250742.304)            (3273054.737)      
## industry: Retail/Advertising & Marketing                                                                             333955.676               537008.862       
##                                                                                                                    (2780591.476)            (2797243.538)      
## industry: Security/Advertising & Marketing                                                                           994491.090               229456.741       
##                                                                                                                    (4269853.654)            (4297055.677)      
## industry: Software/Advertising & Marketing                                                                             6740.409              -147974.928       
##                                                                                                                    (2353508.679)            (2373931.684)      
## industry: Telecommunications/Advertising & Marketing                                                                1292244.152               872129.584       
##                                                                                                                    (3214001.645)            (3233572.572)      
## industry: Travel & Hospitality/Advertising & Marketing                                                             -5997248.023             -6038258.490       
##                                                                                                                    (4436137.037)            (4489594.078)      
## state_l: Alaska/Alabama                                                                                                                     -2607516.435       
##                                                                                                                                            (33113466.400)      
## state_l: Arizona/Alabama                                                                                                                    -1886274.954       
##                                                                                                                                             (5495388.746)      
## state_l: Arkansas/Alabama                                                                                                                   -2387133.685       
##                                                                                                                                            (11803045.516)      
## state_l: California/Alabama                                                                                                                  1830031.578       
##                                                                                                                                             (4626353.336)      
## state_l: Colorado/Alabama                                                                                                                    4963593.705       
##                                                                                                                                             (5343438.143)      
## state_l: Connecticut/Alabama                                                                                                                 1412658.825       
##                                                                                                                                             (6542115.319)      
## state_l: Delaware/Alabama                                                                                                                     720115.417       
##                                                                                                                                             (8930072.472)      
## state_l: District of Columbia/Alabama                                                                                                         234794.250       
##                                                                                                                                             (6628397.086)      
## state_l: Florida/Alabama                                                                                                                     1132774.546       
##                                                                                                                                             (4829854.286)      
## state_l: Georgia/Alabama                                                                                                                      631087.384       
##                                                                                                                                             (4999628.870)      
## state_l: Hawaii/Alabama                                                                                                                       138302.593       
##                                                                                                                                            (15363414.166)      
## state_l: Idaho/Alabama                                                                                                                       2119383.852       
##                                                                                                                                            (10479894.224)      
## state_l: Illinois/Alabama                                                                                                                     824965.051       
##                                                                                                                                             (4942887.054)      
## state_l: Indiana/Alabama                                                                                                                     2784133.194       
##                                                                                                                                             (5883750.416)      
## state_l: Iowa/Alabama                                                                                                                        -231412.095       
##                                                                                                                                             (7465662.381)      
## state_l: Kansas/Alabama                                                                                                                      -518923.123       
##                                                                                                                                             (7044628.801)      
## state_l: Kentucky/Alabama                                                                                                                    1194550.328       
##                                                                                                                                             (7397216.024)      
## state_l: Louisiana/Alabama                                                                                                                     80548.161       
##                                                                                                                                             (6765981.146)      
## state_l: Maine/Alabama                                                                                                                       5609960.141       
##                                                                                                                                             (9836688.870)      
## state_l: Maryland/Alabama                                                                                                                    1058750.260       
##                                                                                                                                             (5318098.596)      
## state_l: Massachusetts/Alabama                                                                                                               -260999.279       
##                                                                                                                                             (5098812.725)      
## state_l: Michigan/Alabama                                                                                                                     -41141.703       
##                                                                                                                                             (5311293.644)      
## state_l: Minnesota/Alabama                                                                                                                  -5713901.278       
##                                                                                                                                             (5601582.586)      
## state_l: Mississippi/Alabama                                                                                                                -2529426.506       
##                                                                                                                                            (10485781.564)      
## state_l: Missouri/Alabama                                                                                                                    -700946.038       
##                                                                                                                                             (5908223.114)      
## state_l: Montana/Alabama                                                                                                                    -2855999.433       
##                                                                                                                                            (15366070.812)      
## state_l: Nebraska/Alabama                                                                                                                    1241909.562       
##                                                                                                                                             (7637652.213)      
## state_l: Nevada/Alabama                                                                                                                     -1244580.322       
##                                                                                                                                             (7391679.863)      
## state_l: New Hampshire/Alabama                                                                                                               3853167.050       
##                                                                                                                                             (8296964.814)      
## state_l: New Jersey/Alabama                                                                                                                  2443072.845       
##                                                                                                                                             (5138906.913)      
## state_l: New Mexico/Alabama                                                                                                                   767138.638       
##                                                                                                                                            (14112750.199)      
## state_l: New York/Alabama                                                                                                                     535380.615       
##                                                                                                                                             (4805534.735)      
## state_l: North Carolina/Alabama                                                                                                               626474.932       
##                                                                                                                                             (5226074.431)      
## state_l: North Dakota/Alabama                                                                                                                -147291.799       
##                                                                                                                                            (14125189.830)      
## state_l: Ohio/Alabama                                                                                                                         575896.731       
##                                                                                                                                             (5110382.385)      
## state_l: Oklahoma/Alabama                                                                                                                   -2379130.979       
##                                                                                                                                             (7476180.505)      
## state_l: Oregon/Alabama                                                                                                                       209199.022       
##                                                                                                                                             (6139400.096)      
## state_l: Pennsylvania/Alabama                                                                                                                1051972.893       
##                                                                                                                                             (5049539.864)      
## state_l: Puerto Rico/Alabama                                                                                                                 2168871.361       
##                                                                                                                                            (23625913.352)      
## state_l: Rhode Island/Alabama                                                                                                                1885067.291       
##                                                                                                                                             (9126225.490)      
## state_l: South Carolina/Alabama                                                                                                               823112.477       
##                                                                                                                                             (6276706.842)      
## state_l: South Dakota/Alabama                                                                                                               -1503832.357       
##                                                                                                                                            (23633610.162)      
## state_l: Tennessee/Alabama                                                                                                                  -3674856.239       
##                                                                                                                                             (5779448.024)      
## state_l: Texas/Alabama                                                                                                                       5993154.622       
##                                                                                                                                             (4745857.125)      
## state_l: Utah/Alabama                                                                                                                        2204590.821       
##                                                                                                                                             (5686504.560)      
## state_l: Vermont/Alabama                                                                                                                    -1400313.358       
##                                                                                                                                            (19498491.091)      
## state_l: Virginia/Alabama                                                                                                                    1075027.639       
##                                                                                                                                             (4812382.520)      
## state_l: Washington/Alabama                                                                                                                  1491777.894       
##                                                                                                                                             (5473415.760)      
## state_l: West Virginia/Alabama                                                                                                              -2241460.914       
##                                                                                                                                            (17017182.657)      
## state_l: Wisconsin/Alabama                                                                                                                 -16355500.032**     
##                                                                                                                                             (5942247.357)      
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared                                                                         0.897                   0.898                   0.899                   0.899
## adj. R-squared                                                                    0.897                   0.898                   0.898                   0.898
## sigma                                                                      32926060.992            32768803.256            32741283.113            32773391.395
## F                                                                             43450.196               21958.659                1693.215                 578.653
## p                                                                                 0.000                   0.000                   0.000                   0.000
## Log-likelihood                                                               -93642.568              -93618.130              -93601.893              -93581.531
## Deviance                                                        5418459211168380928.000 5365750950705011712.000 5331014325546490880.000 5287770588417689600.000
## AIC                                                                          187291.135              187244.259              187259.785              187319.061
## BIC                                                                          187310.687              187270.328              187442.267              187827.402
## N                                                                              5000                    5000                    5000                    5000    
## ===============================================================================================================================================================

I built a linear model using the variables that my analysis highlighted had some effect on growth: revenue, industry, workers, state. The linear model appears to be relatively strong with an R-squared value of 0.9. The variables in this linear model account for about 90% of the variance in growth. Adding state to the linear model did not boost the R-squared value much which confirms that the state feature has little impact on the growth of a company. However, revenue is by far the greatest predictor of growth.


Final Plots and Summary

Plot One

Description One

The distribution of growth and revenue appears to follow a normal distribution after performing a log transformation on both growth and revenue. Growth follows a more normal distribution than Revenue which appears to be truncated on the left side. This could be a result of the dataset containing only 5000 companies.

Plot Two

Description Two

Companies in non-IT-related industries experience greater revenue and growth than companies in the IT industries (including IT Services, software, and computer hardware). This trend is less prevalent in lower levels of revenue and growth where IT and non-IT-related companies are more competitive. In fact, IT industries exceed non-IT industries in growth by revenue at around $150-200 Million mark in terms of revenue and growth in dollars.

But at the highest levels of revenue, around $1.2-2 Billion, non-IT industries far outperform IT industries in terms of growth by revenue. This is likely due to capital-intensive and large-scale operations involved in healthcare, energy, and construction industries.

I decided not to perform a log10 transformation on the x and y axes for this plot because I was more interested in the upward trends extrapolating from the smoothing function and therefore wanted to see where the line would go beyond $2 Billion, rather than focusing on what is under $250 Million.

Plot Three

Description Three

Calculating growth per worker is a better indicator of the efficiency of a company because it evens the playing field between large and small companies. The plot shows that companies with fewer employees tend to have more efficiency, that is, those companies generate less revenue but still achieve the highest strata of growth. In comparison, larger companies with more employees and greater revenue are generally achieving a lower level of growth. This trend is a continuum as there is some spread from left to right indicating some smaller companies with fewer employees are not as efficient as larger companies with greater number of employees.


Reflection

The companies dataset contains information on the 5000 fastest growing private companies in the U.S. in 2014. I began my analysis with performing descriptive statistics to understand what the variables in the dataset mean and their distributions.

I started with the dominant question: what affects company growth? Since this dataset was about the fastest growing private companies, I was looking for variables that might shed light on what increases or decreases growth. I looked at different relationships between multiple features and eventually created a linear model using the variables revenue, industry, and workers which emerged as the most influential features that affect growth based on the plots and analysis of the dataset.

My conclusion is that revenue is the most robust feature in my dataset to predict growth. It accounts for almost 90% of the variance in growth. The industry and number of employees of a company also has an effect, albeit, a much smaller effect than revenue.

Where did I run into difficulties in the analysis?

The hardest part was understanding how different variables related to each other and how to tease apart their effect on growth. It was clear from the beginning that revenue has a direct effect on growth since growth is a calculation of the difference in revenue from the previous and current year.

However, I struggled to tease about how much other variables would affect growth and whether there was a meaningful relationship among other features, such as industry, state, workers, etc.

I also found adding the additional population data helpful in the beginning. However, once it became clear that the feature state played a less meaningful role compared to other features, I realized state population was a deadend.

Where did I find successes?

I found a lot of interesting analysis when creating multivariate plots which helped clarify and drill into the relationship between revenue and growth. The plots that included industry and workers in the growth by revenue analysis offered more nuanced insight into what kinds of companies (for example, larger or smaller) in which industries had higher levels of growth.

I also spend a lot of time teasing apart the different perspectives on growth, for example growth measured in dollars, percentage, or by the number of workers, to understand the different facets on this feature. It helped to inform later analysis.

How could the analysis be enriched in future work (e.g. additional data and analyses)?

  • More features to describe the companies. I would have liked to have the year the companies were founded as I wondered whether young companies experienced greater growth than older ones. The years on list feature was a small hint at the answer by far from robust enough to use in my analyses as a marker of company age.
  • Since the state population data was not a main focus, I didn’t spend a lot of time aggregating data from the company dataset to calculate the state-by-state view. With more time, looking at the growth data from a state and even metro level would be interesting to see if certain states foster more growth because there are larger populations. Or perhaps certain metro areas are foster more growth in certain industries.
  • Additional data on tax rates for different locales could set light on whether some areas in the U.S. foster more growth than others because the tax laws are more favorable to businesses.
  • Since the dataset is looking at private companies, there is little public data about the assets, capital, debt, and other factors related to businesses. If this data were available, it could be interesting to evaluate my hypothesis that the larger companies also generate more revenue and growth because they are in capital-intensive industries.
  • I would also like to test the linear model to predict which companies will have the greatest growth and determine how accurate my model is. As a result, I would also be able to test the assumptions I have made through my analyses and determine if conclusions based on the plots are accurate.
  • Finally, it would be useful to have the Inc. 5000 data from the last 5 years to compare with the 2014 dataset and see how the companies have changed over time. For example, are certain industries more popular in some states over time? Ideally, I would also like to have the metrics on the companies from my current dataset to see if their growth was also correlated with an increase in number of employees or moving offices from one state to another.